Method and apparatus for schema-driven XML parsing optimization

ABSTRACT

Schema-driven XML parsing techniques allow an XML parser to optimize its parsing process by composing parse and to dynamically generate parsing code components based on XML schema definition for the targeted XML document. These techniques reduce the XML parsing time and reduce the memory requirement during parsing process. Further, a reconfigurable parser is provided which is guided during parsing of the XML document by XML element lexicographical information and state transition information extracted from a schema associated with the XML document. Pre-allocated element object pools may be provided based on the schema analysis to reduce the requirements for dynamic memory allocation and de-allocation operations.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. provisional applicationNo. 60/516,037 filed on Oct. 30, 2003, incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to methods and systems for data exchangein an information processing system. In particular, the presentinvention relates to providing an optimized parser for processing astructured document (e.g., a XML document) in an information processingsystem.

2. Discussion of the Related Art

XML is a platform-independent text-based document format¹ designed to beused in structured documents maintained in an information processingsystem. XML documents (e.g., forms) have become the favored mechanismfor data exchange among application programs sharing data over a network(e.g., the Internet). XML documents have the advantages that (a) theinformation in an XML document is extensible (i.e., an applicationprogram developer can define a document structure using, for example, anXML schema description), and (b) through the XML schema, an applicationdeveloper can control the range of values that can be accepted for anyof the XML element or attribute in the structured document. For example,in an XML schema-defined form for a pair of shoes, the applicationprogram developer may constrain the shoe size attribute accepted by theform to be between 5 and 12. As a result, the form would reject asinvalid input a shoe size of 100.¹In this description, the platform-independent text-based documentformat means a text format for defining a document which is independentof the underlying software platform (e.g., the operating system), theunderlying hardware platform, or both.

Because of these advantages, XML is widely used in consumer applicationprograms. However, additional processing overhead is imposed on theapplication program to allow XML to be read and edited easily by a humanusing a word processor as interface, because the structured data in anXML document are required to be parsed by the application intorepresentations that can be manipulated in the computer by theapplication program. Parsing requires intensive computational resources,such as CPU cycles and memory bandwidth, as the application programprocesses the XML elements or attributes one character at a time, inaddition to implementing the higher level processing requirements of theXML schema. In a typical XML document, there can be a large number ofelements and attributes which are defined in the schema using differentdata types and constraints. Character-matching is not efficient inexisting hardware implementations, such as those based on IA32 and ARMarchitectures.

Parsers for documents written in numerous languages have been developedand used throughout the history of computers. For example, the firstwidely accepted parsers (which also validate) for XML are based on theW3C Document Object Model (DOM). DOM renders the information on an XMLdocument into a tree structure. Thus, a parser based on DOM constructs a“DOM tree” in memory to represent the XML document, as it reads the XMLdocument. The DOM tree is then passed to the application program whichtraverses the DOM tree to extract its required information. Constructinga DOM tree in memory is not only time-consuming, it requires a largeamount of memory. In fact, the memory occupied by a DOM tree is usually5-10 times greater than that of occupied by the original XML document.One optimization constructs a partial DOM tree in memory as needed toreduce the memory requirement and the processing time.

Alternatively, an XML document may be parsed based on a streaming model.Parsers using the streaming model include SAX and Pull. Under thestreaming model, rather than a parse tree, a parser outputs a continuousstream of XML elements, together with the values of their attributes, asthe XML document is parsed. Typically, such a parser reads from the XMLdocument one XML element at a time, and passes to the consumingapplication the values of the element and their associated attributes.Although a streaming-based parser is efficient in its memory andprocessing speed requirements, such a parser merely tokenizes a stringinto segments of text without interpretation. The interpretation of datacontained in each text segment is entirely left to the consumingapplication program. Thus, the burden of XML processing—which is toprovide data in an XML document to the application program in a mannerthat can be readily used by the application program—is shifted from theparser to the consuming application program.

A parser may or may not validate an XML document. Validation is theprocess by which each parsed XML element is compared against itsdefinition defined in an XML schema (e.g., an XML DTD file). Validationtypically requires string pattern-matching as the validation programsearches the multiple element definitions in the XML schema. Aconventional approach to simplify validation is to convert thedefinitions of an XML schema into component models, expressed as aseries of Java bean classes. An application program may then check theXML elements using methods provided in the Java bean classes. Whileschema conversion methods may speed up both the parsing and thevalidation processes to some limited degree, such conversions do notprovide the fast string pattern-matching desired in XML parsing andvalidating.

As is apparent from the above, XML parsing involves a substantial amountof string-matching operations, which are the most CPU intensiveoperations in XML parsing. Further, the memory requirements of parsingXML elements also lead to a substantial amount of inefficient memoryallocation and de-allocation operations.

SUMMARY

The present invention provides an XML validating parser that candynamically generate executable parsing codes based on informationextracted from an XML schema document that is either stored locally orobtained from a remote machine via network. According to another aspectof the invention, a schema-based, reconfigurable parser is provided.

In one embodiment, each XML element of an XML document is parsed andvalidated using a dedicated executable parsing code (“parselet”), whichnavigates the structure of the XML element, its attribute values andconstraints to validate the element. If the element is valid, theexamined XML element is passed to a consuming application programrequiring the XML document to be processed. Otherwise, an invalidexception is raised and the consuming application program is notified.Because parsing in this instance is performed in a compiled executableparselet, parsing is faster than the interpretive parsers of the priorart and the memory requirement for string matching can be much reduced.

According to one embodiment of the present invention, a lexicographicalanalysis of the XML elements is performed in advance for a given XMLschema to provide: (1) state-transition sequence information, and (2)element and attribute lexicographical distance information. Thetransition sequence information can be used to guide the parser as tothe XML elements that may be expected to follow according to the givenschema. The element and attribute lexicographical distance measures aminimal lexicographical distance between two strings (i.e., the smallestindices in the strings sufficient to identify and distinguish thestrings). This information is useful for guiding the parser to identifythe element or attribute of the XML document using the minimal amount ofstring comparison.

In one embodiment of the present invention, pools of XML element objectshaving pre-determined element-attribute structures are created when theparser is instantiated, which are dynamically managed so that the sizesof the pools vary as needed. A schema analysis method provides memoryrequirement information to allow a parse tree of the XML document bebuilt in memory as a DOM object using objects in the element objectpools. The schema analysis also provides information for managing thesizes of element pools and the type of element objects in each pool.Having element object designs in the pools alleviate the parser's memorymanagement requirements.

A parselet of the present invention is a compiled, executable code thatexecutes much faster than a corresponding interpretive code whichexamines the schema tree and XML document tree at run time. In addition,the present invention provides multiple parselets for different XMLelements, so that parallel processing of multiple XML elementssimultaneously is possible. The compiled parselets may be used formultiple XML documents based on the same XML schema.

An XML schema-driven parser of the present invention may be configurableto output XML elements in the form of a DOM tree (or a similar parsetree), or in a stream, as appropriate, according to the consumingapplication program. The parse tree of the present invention need notprovide in memory an entire DOM tree. The present invention allows apartial parse tree to be constructed on demand, including as little as asingle element. Thus, the memory requirement may be reducedsignificantly while still allowing application programs to access XMLdocument using DOM APIs.

By using minimal lexicographical distances between XML elements,string-matching operations are significantly reduced, thus significantlyreducing the parser's demand on computational power.

By using pools of XML elements, significant amount of dynamic memoryallocation operations may be avoided during parsing. When an elementobject is no longer needed (e.g. in the case of a streaming-basedparsing), the element object is returned to the respective pool to bereused instead of being de-allocated. As a result of maintaining elementpools, significant reductions in the requirements on CPU and memoryresources are achieved. For example, the need for a garbage collectorprocess—which does not reclaim memory from finalized objectsimmediately—may be avoided. Avoiding the need for a garbage collectionprocess also reduces the requirements on the CPU.

The present invention is better understood upon consideration of thedetailed description below and the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a block diagram of schema-driven parser generator system 120,according to one embodiment of the present invention.

FIG. 2 is a flow chart illustrating the operations of schema-drivenparser generator system 120 under push mode.

FIG. 3 summarizes the operations of schema-driven parser generatorsystem 120 under push mode.

FIG. 4 is a flow chart illustrating the operations of schema-drivenparser generator system 120 under pull mode.

FIG. 5 summarizes the operations of schema-driven parser generatorsystem 120 under pull mode.

FIG. 6 shows dynamically reconfigurable parser system 620, according toanother aspect of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The present invention provides a validating parser that validates an XMLdocument as it is parsed, thereby reducing the validating and parsingtime requirements and the memory bandwidth requirement. The presentinvention may be applied to implement a parser code generator and areconfigurable parser.

According to one embodiment of the present invention, based on aspecific XML schema, a validating parser generator dynamically generatesan executable parser code (“parselet”) for implementing parsing of aspecific XML element in the XML schema. FIG. 1 is a block diagram ofschema-driven parser generator system 120, according to one embodimentof the present invention. As shown in FIG. 1, schema-driven validatingparser system 120 includes:

-   -   (1) schema reader 102, which reads one ore more XML schema        documents (e.g., XML document 100);    -   (2) XML reader 112, which reads XML documents (e.g., XML        document 110);    -   (3) XML parser integrator 104, which coordinates among the        different components of the XML parser (in this example, the XML        parser consists of (a) XML parser integrator 104, (b) XML parser        generator 106, and (c) parselets (e.g., parselet 108); and    -   (4) XML output module 114, which outputs validated, parsed XML        elements of the XML document for use by XML application program        116.

Parselets (e.g., parselet 108) are excutable parsing codes created byXML parser generator 106 based on XML elements read by schema reader 102from XML schema document 100. Each parselet is called to parse an XMLelement in an XML document (e.g., XML document 110) and to extractelement and attribute values from the XML element and all its includedXML elements. The parsed elements are output by XML output module 114 toXML application program 116. Schema-driven XML parser system 120 isespecially beneficial when multiple XML documents are based on the sameunderlying XML schema, so that each parselet can be re-used multipletimes.

XML schema document 100 may be a document retrieved from a local storageor from a remote storage via a network (e.g., the Internet) using one ofvarious network transport protocols (e.g., FTP, HTTP, and SSL). Schemareader 102 reads the XML elements from XML schema document 100 oneelement at a time. In this embodiment, if an element is nested (i.e., itcontains other XML elements), the contained elements are read beforereading of the containing element is complete. Schema reader 102 mayimplement a push style or a pull style of reading XML elements. Under apush style, schema reader 102 continuously read XML schema document 100until an entire element is read, whereupon schema reader 102 notifiesXML parser integrator 104. Under a pull style, XML parser integrator 104requests that schema reader 102 read the next XML element from schemadocument 100.

When an entire element is completely read, XML parser integrator 104causes XML parser generator 106 to generate a corresponding parselet forthe XML element read. XML parser integrator 104 maintains a mappingtable, which includes all relationships between a parselet, all itscontaining parselets and all the parselets it contains. In addition, themapping table also records a name space and a qualified name for eachparselet (a qualified name is typically a prefix encoding a path to theparselet).

To illustrate an application of the present invention, an example of anXML document is provided in a “PurchaseOrder” document shown in AppendixA. As shown in Appendix A, PurchaseOrder is an XML element whichincludes other XML elements “shipTo”, “billTo”, “comment” and “items”.Elements “shipTo” and “billTo” each include instances of elements“name”, “street”, “city”, “state” and “zip”. Element “items” may includeone or more instances of element “item.” Element “item” may includeinstances of elements “productName”, “quantity” “USPrice”, “comment” and“shipDate”. One or more attributes may be found in each element, whichvalues are provided by a string representing the appropriate data type.For example, element “PurchaseOrder” includes attribute “orderDate” andelement “shipTo” includes attribute “country”. Appendix A is the form ofthe document that is typically exchanged between the client (e.g., a webbrowser) and the application program. The corresponding schema documentis shown in Appendix B.

Appendix B shows a schema which includes, at the top level, elements“purchaseOrder” and “comment”. Element “comment” is defined in theschema to be a string. Element “purchaserOrder” is defined in the schemato be an element of the data type “PurchaseOrderType”, which is definedto include elements, sequentially, “shipTo”, “billTo”, “comment” and“items”, and attribute “orderDate” as already seen in Appendix A. Theterm “sequence” indicates the order in which the elements appears in theschema is expected to be the order in which those elements appear in theXML document. The schema further defines that the elements “shipTo” and“billTo” are both of the data type “USAddress”, which is defined toinclude elements “name”, “street”, “city”, “state” and “zip”, as alsoseen in Appendix A. The schema defines “name”, “street”, “city” and“state” each to be a string and “zip” to be of the data type “Decimal”.Similarly, element “items” is defined in the schema to include zero ormore instances of element “item”. Element “item” includes, sequentially,elements “productName”, “quantity”, “USPrice”, “comment” and “shipDate”.Element “item” also has an attribute “partNum” which is of the data type“SKU”—a format for a part number which is also defined in the schema..The data type of element “item” is not provided a name. Using theinformation specified in the schema of Appendix B, parser generator 106generates the parse code for parsing the elements as they appear in theXML document. One example of the parse code for element “purchaseOrder”is shown in Appendix C.

Thus, XML parser generator 106 generates a parselet for the element“purchaseOrder” which includes also code generated for parsing all theincluded elements. Note that, in this embodiment, as the simple datatypes “string”, “Decimal” and “Integer” and the various variations ofthe data type “Date” are encountered frequently in XML documents, theparse code for these data types are not generated specifically for everyschema. Rather a base class “Parselet” is provided, and the specificallygenerated parselets, such as “purchaseOrder” is derived from the“Parselet” class, so that parse codes for these common data types areassociated with every specifically generated parselet. An example of“Parselet” class is provided in Appendix D.

In Appendix D, the methods for parsing data types “string”, “Decimal”and “Integer” and the various variations of “Date” are provided as“parseString”, “parseDecimal” and “parseInteger” and “parseDate”,respectively. In addition, methods for validating elements andattributes (e.g., “isElement” and “isAttribute”) are also provided inclass “Parselet”. During parsing, “Parselet” keeps track of its progressthrough the XML document—i.e., where in the XML document is the currenttext object being parsed—by the method “EEMoveCursor”. Error conditionor “exception” handler “InvalidSchema” may be called from “Parselet”. Anexample of “InvalidSchema” is provided in Appendix E. As discussedabove, the output from the parsing operation may be a DOM tree, which isbuilt from a number of DOM nodes interconnected from a root node. Anexample of some pseudocode for creating the DOM tree is provided theclass “Node” in Appendix F.

Returning to Appendix C (i.e., the listing of generated parselet“purchaseOrder”), according to the structure of the XML document“purchaseOrder” as defined in the XML schema, parselet “purchaseOrder”parses both elements “purchaseOrder” and “comment” at the top level ofthe schema. To parse element “purchaseOrder”, the method“parsePurchaseOrderType” is called to handle the data type“purchaseOrderType”. Method “parsePurchaseOrderType” parses,sequentially, the required attribute “OrderDate”, and each of elements“shipTo”, “billTo”, “comment” and “items”. As elements “shipTo” and“billTo” are both of the same data type “USAddress”, parsing of eachelement is handled by method “parseUSAddress”. Element “items” ishandled by method “parseltems”, which is also generated according to thestructure defined in the schema. As the data type of element “item”contained in element “items” is not given a name, XML parser generator106 gives the method for handling this data type the name“parseUnnamed1”. Method “parseUnnamed1” parses, sequentially, therequired attribute “partNum” and elements “productName”, “quantity”,“USPrice”, “comment” and “shipDate”. Note that the parsing code alsovalidates each element including testing if the supplied values arewithin the accepted range of values for each element. Attribute“partNum” is parsed using the generated method “parseSKU”. As eachelement is successfully parsed, a node corresponding to the element isadded to the parse tree using method “addChild” in the class “Node”defined in Appendix F.

During actual parsing operation, when an element is read by XML reader112 from XML document 110, a corresponding parselet (say, parselet 108)is selected from mapping table, based on the name space, the element'squalified name, and the relationships involving parselet 108. Whenparselet 108 completes its parsing task, the parsed XML element isforwarded to parser integrator 104, which passes the parsed element toXML output module 114.

According to one embodiment, parsed XML elements output from theparselets are validated against the elements' definitions in theirrespective schemas. Here, the term “parsed” may mean either (1) that theXML element has been converted into a structured data representation,such as a DOM tree, or (2) that the textual XML element has beenvalidated by parselet 108 to conform its corresponding definition in theschema document. In the case of a DOM tree, an application program candirectly access the XML element through XML output module 114.Alternatively, i.e., the validation of the textual XML element withoutmore requires both lesser processing time and memory, relative tobuilding a DOM tree.

FIG. 2 is a flow chart illustrating the operations of schema-drivenparser generator system 120 under push mode. As shown in FIG. 2, at step200, XML reader 112 reads XML document 110 one element at a time. WhenXML element is completely read, XML reader 112 generates an event, whichis assigned a sequence number at step 202. The sequence number may beprovided in ascending order by a counter. At step 204, XML reader 112notifies XML parser integrator 104 of the event. Upon notification ofthe event, at step 206, XML parser integrator 104 examines and uses thename space and qualified name of the XML element associated with theevent to select from the mapping table an appropriate parselet. At step208, the selected parselet parses and validates the XML element. Theparselet may provide its output in a DOM tree structure, or simplyprovide a textual representation of the XML element, according to therequirement of XML application program 116. At step 210, the selectedparselet notifies XML parser integrator 104 of the parsed XML element.XML parser integrator 104 then pass the result of the parsing, togetherwith information regarding the event, to XML output module 114 at step212. XML output module 114 maintains a queue of parsed elementssequentially of event sequence numbers. At step 214, XML output module114 notifies application program 116 of the parsed element being addedto the queue.

FIG. 3 summarizes the operations of schema-driven parser generatorsystem 120 under push mode.

FIG. 4 is a flow chart illustrating the operations of schema-drivenparser generator system 120 under pull mode. As shown in FIG. 4, in thepull mode, XML application program 116 initiates the parsing andvalidating process by asking the XML output module 114 for the next XMLelement at step 300. XML output module 114 in turn requests the next XMLelement from XML parser integrator 104 at step 302. At step 304, XMLparser integrator 104 then requests XML reader 112 read the next XMLelement from XML document 110, which is accomplished at step 306. Atstep 308, XML reader 112 passes the XML element read to XML parserintegrator 104, which selects at step 310 corresponding parselet 108 forvalidation and parsing based on the XML element's name space andqualified name. At step 312, selected parselet 108 then parses andvalidates the XML element. As in the push mode, the parsed XML elementmay be represented in a DOM tree structure, or in a textualrepresentation, according to the requirements of application program116. Parselet 108 provides the parsed XML element to XML parserintegrator 104 at step 314. At step 316, XML parser integrator 104provides the parsed XML element to XML output module 114, which providesthe parsed XML element to XML application program 116 at step 318.

FIG. 5 summarizes the operations of schema-driven parser generatorsystem 120 under pull mode.

According to another aspect of the present invention, FIG. 6 showsdynamically reconfigurable parser system 620. As shown in FIG. 6,reconfigurable XML parser 603 uses schema analyzer 601 to obtain inlexicographical order XML elements defined in a schema document 600.Reconfigurable parser 603 then compares between every pair of adjacentXML elements to determine a minimal lexicographical distance betweenthem. These minimal lexicographical distances guide reconfigurableparser 603 to identify XML elements during parsing. That is,reconfigurable parser 603 need only perform pattern-match sufficient torecognize the attribute or element being parsed. In addition to minimallexicographical distances, schema analyzer 601 also providesstate-transition information, such as a list of possible next elementsthat may appear in the XML document, as determined based on the currentstate.

Based on the information provided by schema analyzer 601, reconfigurableparser 603 parses XML document 604. According the present invention,reconfigurable parser 603 manages a number of element pools 605, whichare created at system initialization according to the expected elementsto be encountered. Each element pool includes a number of pre-allocateddata structure (“XML element object”) created from a template of anexpected XML element. As each element is successfully parsed, an XMLelement object is retrieved from the appropriate element pool andassigned as a node in a parse tree. Element pools 605 are resizable andcan vary in size dynamically, as reconfigurable parser 603 parses XMLdocument 604, according to the size and complexity of XML document 604.

In one embodiment, application program 602 invokes XML parser 603 toparse XML document 604. Initially, reconfigurable parser 603 identifiesthe references in XML document 604 to XML elements defined in XML schemadocument 600. The schema references are provided schema analyzer 601 toretrieve previously extracted lexicographical and state-transitioninformation corresponding to these references. Reconfigurable parser 603then parses the XML references, requests XML element objectscorresponding to the parsed elements from XML element pools 605, fillsthe XML element objects returned with the parsed data, links the XMLelement object to a parsed tree. When all XML references are parsed, theparse tree is provided to application program 602.

The above detailed description is provided to illustrate the specificembodiments of the present invention and is not intended to be limiting.Numerous modifications and variations within the present invention arepossible. The present invention is set forth in the following claims.<?xml version=“1.0”?> <purchaseOrder orderDate=“1999-10-20”>   <shipTocountry=“US”>    <name>Alice Smith</name>    <street>123 MapleStreet</street>    <city>Mill Valley</city>    <state>CA</state>   <zip>90952</zip>   </shipTo>   <billTo country=“US”>     <name>RobertSmith</name>     <street>8 Oak Avenue</street>     <city>Old Town</city>    <state>PA</state>     <zip>95819</zip>   </billTo>   <comment>Hurry,my lawn is going wild!</comment>   <items>     <item partNum=“872-AA”>      <productName>Lawnmower</productName>       <quantity>1</quantity>      <USPrice>148.95</USPrice>       <comment>Confirm this iselectric</comment>     </item>     <item partNum=“926-AA”>      <productName>Baby Monitor</productName>      <quantity>1</quantity>       <USPrice>39.98</USPrice>      <shipDate>1999-05-21</shipDate>     </item>   </items></purchaseOrder>

<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>  <xsd:elementname=“purchaseOrder” type=“PurchaseOrderType”/>  <xsd:elementname=“comment” type=“xsd:string”/>  <xsd:complexTypename=“PurchaseOrderType”>   <xsd:sequence>    <xsd:element name=“shipTo”type=“USAddress”/>    <xsd:element name=“billTo”  type=“USAddress”/>   <xsd:element ref=“comment” minOccurs=“0”/>    <xsd:elementname=“items”  type=“Items”/>   </xsd:sequence>   <xsd:attributename=“orderDate” type=“xsd:date”/>  </xsd:complexType>  <xsd:complexTypename=“USAddress”>   <xsd:sequence>    <xsd:elementname=“name”  type=“xsd:string”/>    <xsd:element name=“street”type=“xsd:string”/>    <xsd:element name=“city”  type=“xsd:string”/>   <xsd:element name=“state”  type=“xsd:string”/>    <xsd:elementname=“zip”  type=“xsd:decimal”/>   </xsd:sequence>   <xsd:attributename=“country” type=“xsd:NMTOKEN”      fixed=“US”/>  </xsd:complexType> <xsd:complexType name=“Items”>   <xsd:sequence>    <xsd:elementname=“item” minOccurs=“0” maxOccurs=    “unbounded”>    <xsd:complexType>      <xsd:sequence>       <xsd:elementname=“productName” type=“xsd:string”/>       <xsd:elementname=“quantity”>        <xsd:simpleType>         <xsd:restrictionbase=“xsd:positiveInteger”>          <xsd:maxExclusive value=“100”/>        </xsd:restriction>        </xsd:simpleType>       </xsd:element>      <xsd:element name=“USPrice” type=“xsd:decimal”/>      <xsd:element ref=“comment” minOccurs=“0”/>       <xsd:elementname=“shipDate” type=“xsd:date”       minOccurs=“0”/>     </xsd:sequence>      <xsd:attribute name=“partNum” type=“SKU” use=     “required”/>     </xsd:complexType>    </xsd:element>  </xsd:sequence>  </xsd:complexType>  <!-- Stock Keeping Unit, a codefor identifying products -->  <xsd:simpleType name=“SKU”>  <xsd:restriction base=“xsd:string”>    <xsd:patternvalue=“\d{3}−[A-Z]{2}”/>   </xsd:restriction>  </xsd:simpleType></xsd:schema>

/*  * Created on Aug 17, 2004  *  * To change the template for thisgenerated file go to  * Window - Preferences - Java - Code Generation -Code and Comments  */ package com.docomo.ss.examples; importcom.docomo.ss.InvalidSchema; import com.docomo.ss.Node; importcom.docomo.ss.Parselet; /**  * @author zhou  *  * To change the templatefor this generated type comment go to  * Window - Preferences - Java -Code Generation - Code and Comments  */ public class PurchaseOrderextends Parselet {   private static char[ ] ename_purchaseOrder = {‘p’,‘u’, ‘r’, ‘c’, ‘h’, ‘a’, ‘s’, ‘e’, ‘O’, ‘r’, ‘d’,       ‘e’, ‘r’};  private static char[ ] ename_comment = {‘c’, ‘o’, ‘m’, ‘m’, ‘e’, ‘n’,‘t’};   public Node parse (char[ ] d, int off) {     doc = d;     offset= off;     Node res = Node.create ( );     int i = offset++;     while(true) {       try {         if (doc[i] == ‘p’ &&isElementFrom1(ename_purchaseOrder)) {           increaseOffset(ename_purchaseOrder);  res.addChild(parsePurchaseOrderType(“purchaseOrder”));         } elseif (doc[i] == ‘c’ && isElementFrom1(ename_comment)) {          increaseOffset (ename_comment);          res.addChild(parseString(ename_comment, “comment”));         }else throw new InvalidSchema ( );         // try to check validity andexit condition here       } catch (Exception e) {        e.printStackTrace ( );         break;       }     }     returnres;   }   private static char[ ] ename_shipTo = {‘s’, ‘h’, ‘i’, ‘p’,‘T’, ‘o’};   private static char[ ] ename_billTo = {‘b’, ‘i’, ‘l’, ‘l’,‘T’, ‘o’};   private static char[ ] ename_items = {‘i’, ‘t’, ‘e’, ‘m’,‘s’};   private static char[ ] aname_orderDate = {‘o’, ‘r’, ‘d’, ‘e’,‘r’, ‘D’, ‘a’, ‘t’, ‘e’};   Node parsePurchaseOrderType (String name)throws InvalidSchema   {     Node res = Node.create ( );     // Parseattributes     if (isAttribute (aname_orderDate)) {       increaseOffset(aname_orderDate, 2);       res.addAttribute (parseDate ( ),“orderDate”);     }     while (doc[offset++] != ‘<’);     // parse asequence of elements     if (isElement (ename_shipTo)) {      increaseOffset (ename_shipTo);      res.addChild(parseUSAddress(“shipTo”));     } else throw newInvalidSchema ( );     if (isElement (ename_billTo)) {      increaseOffset (ename_billTo);      res.addChild(parseUSAddress(“billTo”));     } else throw newInvalidSchema ( );     if (isElement (ename_comment)) {      increaseOffset (ename_comment);       res.addChild(parseString(ename_comment, “comment”));     }     if (isElement (ename_items)) {      increaseOffset (ename_items);       res.addChild (parseItems(“items”));     } else throw new InvalidSchema ( );     //tailprocessing TODO     if (! EEMoveCursor (ename_purchaseOrder))      throw new InvalidSchema ( );     return res;   }   private staticchar[ ] ename_name = {‘n’, ‘a’, ‘m’, ‘e’};   private static char[ ]ename_street = {‘s’, ‘t’, ‘r’, ‘e’, ‘e’, ‘t’};   private static char[ ]ename_city = {‘c’, ‘i’, ‘t’, ‘y’};   private static char[ ] ename_state= {‘s’, ‘t’, ‘a’, ‘t’, ‘e’};   private static char[ ] ename_zip = {‘z’,‘i’, ‘p’};   Node parseUSAddress (String name) throws InvalidSchema {    Node res = Node.create ( );     // Parse attributes TODO     //Parse sequence     if (isElement (ename_name)) {       increaseOffset(ename_name);       res.addChild(parseString (ename_name, “name”));    } else throw new InvalidSchema ( );     if (isElement(ename_street)) {       increaseOffset (ename_street);      res.addChild(parseString (ename_street, “street”));     } elsethrow new InvalidSchema ( );     if (isElement (ename_city)) {      increaseOffset (ename_city);       res.addChild(parseString(ename_city, “city”));     } else throw new InvalidSchema ( );     if(isElement (ename_state)) {       increaseOffset (ename_state);      res.addChild(parseString (ename_state, “state”));     } else thrownew InvalidSchema ( );     if (isElement (ename_zip)) {      increaseOffset (ename_zip);       res.addChild(parseDecimal(ename_zip, “zip”));     } else throw new InvalidSchema ( );     //tailprocessing TODO     return res;   }   private static char[ ] ename_item= {‘i’, ‘t’, ‘e’, ‘m’};   Node parseItems (String name) throwsInvalidSchema {     Node res = Node.create( );     //parse sequence    while (true) {       if (isElement (ename_item)) {        increaseOffset (ename_item);         res.addChild(parseUnnamed1(“item”));       } else break;     }     // tail processing TODO    return res;   }   private static char[ ] ename_productName = {‘p’,‘r’, ‘o’, ‘d’, ‘u’, ‘c’, ‘t’,       ‘N’, ‘a’, ‘m’, ‘e’};   privatestatic char[ ] ename_quantity = {‘q’, ‘u’, ‘a’, ‘n’, ‘t’, ‘i’, ‘t’,‘y’};   private static char[ ] ename_USPrice = {‘U’, ‘S’, ‘P’, ‘r’, ‘i’,‘c’, ‘e’};   private static char[ ] ename_shipDate = {‘s’, ‘h’, ‘i’,‘p’, ‘D’, ‘a’, ‘t’, ‘e’};   Node parseUnnamed1 (String name) throwsInvalidSchema {     Node res = Node.create ( );     //processingattribute TODO     //processing sequence     if (isElement(ename_productName)) {       increaseOffset (ename_productName);      res.addChild(parseString (ename_productName, “productName”));    } else throw new InvalidSchema ( );     if (isElement(ename_quantity)) {       increaseOffset (ename_quantity);      res.addChild(parseInteger (1, 99, ename_quantity));     } elsethrow new InvalidSchema ( );     if (isElement (ename_USPrice)) {      increaseOffset (ename_USPrice);       res.addChild(parseDecimal(ename_USPrice, “USPrice”));     } else throw new InvalidSchema ( );    if (isElement (ename_comment)) {       increaseOffset(ename_comment);       res.addChild(parseString (ename_comment,“comment”));     }     if (isElement (ename_shipDate)) {      increaseOffset (ename_shipDate);       res.addChild(parseDate(ename_shipDate, “shipDate”));     }     //tail processing TODO    return res;     }   }

/*  * Created on Aug 17, 2004  *  * To change the template for thisgenerated file go to  * Window - Preferences - Java - Code Generation -Code and Comments  */ package com.docomo.ss; /**  * @author zhou  *  *To change the template for this generated type comment go to  * Window -Preferences - Java - Code Generation - Code and Comments  */ publicabstract class Parselet {   public static boolean verify = false;  protected char[ ] doc;   protected int offset;   public abstract Nodeparse (char[ ] doc, int offset);   protected boolean isElementFrom1(char[ ] element) {     for (int i = 1; i < element.length; i++)      if (doc[offset + i] != element[i]) return false;     return true;  }   protected boolean isElement (char[ ] element) {     for (int i =0; i < element.length; i++)       if (doc[offset + i] != element[i])return false;     if (doc[offset + element.length] == ‘ ’ ||        doc[offset + element.length] == ‘>’)       return true;     elsereturn false;   }   protected boolean isAttribute (char[ ] an) {     for(int i = 0; i < an.length; i++)       if (doc[offset + i] != an[i])return false;     return true;   }   protected Node parseString (char[ ]name, String ename) throws InvalidSchema {     if (doc[offset] != ‘>’)throw new InvalidSchema ( );     int start = ++offset;     while(doc[offset] != ‘<’ || doc[offset − 1] == ‘&’) offset++;     Node res =Node.create (ename, new String (doc, start, offset − 1));     returnres;   }   protected Node parseDecimal (char[ ] name, String ename)throws InvalidSchema {     System.out.println (“Parselet.parseDecimalnot implemented!”);     return null;   }   protected Node parseInteger(int min, int max, char[ ] name) throws InvalidSchema {    System.out.println (“Parselet.parseInteger not implemented!”);    return null;   }   protected Node parseDate (char[ ] name, Stringename) throws InvalidSchema {     System.out.println(“Parselet.parseDate not implemented!”);     return null;   }  protected Node parseDate ( ) throws InvalidSchema {    System.out.println (“Parselet.parseDate not implemented!”);    return null;   }   protected void increaseOffset (char[ ] ename) {    offset += ename.length;   }   protected void increaseOffset (char[ ]ename, int additional) {     offset += ename.length + additional;   }  protected boolean EEMoveCursor (char[ ] ename) {     if (verify) {      if (doc[offset] == ‘<’ && doc[offset + 1] == ‘/’) {         int i;        for (i = 0; i < ename.length; i++)           if (doc[offset +i + 2] != ename[i]) return false;         if (doc[offset + i + 2] !=‘>’) return false;       }     }     offset += ename.length + 3;    return true;   } }

/*  * Created on Aug 17, 2004  *  * To change the template for thisgenerated file go to  * Window - Preferences - Java - Code Generation -Code and Comments  */ package com.docomo.ss; /**  * @author zhou  *  *To change the template for this generated type comment go to  * Window -Preferences - Java - Code Generation - Code and Comments  */ publicclass InvalidSchema extends Exception { }

/*  * Created on Aug 17, 2004  *  * To change the template for thisgenerated file go to  * Window - Preferences - Java - Code Generation -Code and Comments  */ package com.docomo.ss; /**  * @author zhou  *  *To change the template for this generated type comment go to  * Window -Preferences - Java - Code Generation - Code and Comments  */ publicabstract class Node implements DOMNode {   public static Node create ( ){     System.out.println (“Node.create not implemented!”);     returnnull;   }   public static Node create (String name, Object o) {    System.out.println (“Node.create not implemented!”);     returnnull;   }   public abstract void addChild (Node n);   public abstractvoid addAttribute (Node n, String name); }

1. A method for parsing an XML document, comprising: Performing ananalysis of the XML document from an XML schema associated with the XMLdocument to extract one or more relationships between XML elementsincluded in the XML document; and parsing the XML elements of the XMLdocument guided by relationships extracted in the analysis.
 2. A methodas in claim 1, wherein the relationships extracted from the analysiscomprises a lexicographical distance between XML elements.
 3. A methodas in claim 1, wherein the relationships extracted from the analysiscomprises state-transition information.
 4. A method as in claim 1,further comprising providing element object pools that are created uponsystem initialization.
 5. A method as in claim 4, further comprising,upon parsing an XML element: retrieving a corresponding element objectfrom the element object pools; and filling the element object withvalues extracted from the parsed XML element.
 6. A method as in claim 5,further comprising providing the element object as a node in a parsetree.
 7. A method as in claim 5, wherein the element object is returnedto the element object pools.
 8. A method as in claim 4, wherein eachelement object in the element object pools correspond to an expecteddata structure of an XML element defined in the XML schema.
 9. Areconfigurable parser for an XML document, comprising: an analyzer forextracting from an XML schema associated with the XML documentrelationships between XML elements included in the XML document; and aparser of the XML elements of the XML document guided by therelationships extracted by the analyzer.
 10. A reconfigurable parser asin claim 9, wherein the relationships extracted by the analyzercomprises a lexicographical distance between XML elements.
 11. Areconfigurable parser as in claim 9, wherein the relationships extractedby the analyzer comprises state-transition information.
 12. Areconfigurable parser as in claim 9, further comprising element objectpools that are created upon system initialization.
 13. A reconfigurableparser as in claim 12, further comprising: a selector for retrieving acorresponding element object from the element object pools, uponsuccessfully parsing an XML element; and a writer for filling in theelement object with values extracted from the parsed XML element.
 14. Areconfigurable parser as in claim 13, further comprising a parse treeconstructor which receives the element object as a node in a parse tree.15. A reconfigurable parser as in claim 13, wherein the element objectis returned to the element object pools.
 16. A reconfigurable parser asin claim 12, wherein each element object in the element object poolscorrespond to an expected data structure of an XML element defined inthe XML schema.
 17. A method for efficiently parsing an XML document,comprising: analyzing a schema associated with the XML document toextract data structures of XML elements of the XML document; generatingparse code for each data structure of the XML elements; and parsing theXML elements using the generated parse code as the XML document is read.18. A method as in claim 17, wherein the generated parse code iscompiled.
 19. A method as in claim 17, further comprising reading theXML elements from the XML document one element at a time.
 20. A methodas in claim 19, wherein the XML elements are read into the parseraccording to a push model.
 21. A method as in claim 19, wherein the XMLelements are read into the parser according to a pull model.
 22. Amethod as in claim 17, further comprising validating the XML elementsagainst the XML schema.
 23. A method as in claim 17, further comprisingproviding the parsed XML elements in a parse tree.
 24. A method as inclaim 17, further comprising providing the parsed XML elements one at atime in a continuous stream.
 25. A parser for an XML document,comprising: a schema analyzer for extracting data structures of XMLelements from a schema associated with the XML document; a parse codegenerator that generates a parse code for each data structure of the XMLelements; and a parser integrator that invokes a corresponding parsecode in respond to each XML element encountered as the XML document isread.
 26. A parser as in claim 25, wherein the generated parse code iscompiled.
 27. A parser as in claim 25, further comprising an XML readerthat reads the XML elements from the XML document one element at a time.28. A parser as in claim 27, wherein the XML elements are read into theparser according to a push model.
 29. A parser as in claim 27, whereinthe XML elements are read into the parser according to a pull model. 30.A parser as in claim 25, wherein the parse code validates the XMLelements against the XML schema.
 31. A parser as in claim 25, furthercomprising an output module that provides the parsed XML elements in aparse tree.
 32. A parser as in claim 25, further comprising an outputmodule that provides the parsed XML elements one at a time in acontinuous stream.